Introduction

library(tidyverse)
library(lubridate)
library(scales)
library(matrixStats)
library(ggrepel)
library(pander)
library(plotly)
panderOptions("big.mark", ",")
panderOptions("table.split.table", Inf)
panderOptions("table.style", "rmarkdown")
theme_set(theme_bw())

Disclaimer: This very simple report was prepared by a bioinformatician with no experience in epidemiology or virology, and as such should be treated simply as an alternate viewpoint on the data, which I was simply unable to find elsewhere. Many other people exist with much greater expertise on this subject. However, I do hope this provides a useful perspective which is able to add constructively to the wider discussion.

Data was sourced from Johns Hopkins University (https://coronavirus.jhu.edu/), using the datasets provided by JHU at https://github.com/CSSEGISandData/COVID-19. JHU data is updated every 24 hours at approximately 23:59(UTC), which is about 10:30AM in Adelaide. As a result Australian numbers may lag local media reports.

Additionally, the official government figures at www.health.gov.au appear to lag media reports, likely to ensure accuracy of the official numbers. The official Australian numbers are not used for this analysis, instead relying on those provided by JHU. However, JHU are extremely careful and comprehensive in their sourcing of numbers and significant disparity is not expected. Live updates for Australia are available from https://covid-19-au.github.io/ for those who would like an up to the minute breakdown of confirmed cases.

confirmed <- url("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv") %>%
    read_csv() %>%
    pivot_longer(
        cols = ends_with("20"),
        names_to = "date",
        values_to = "confirmed"
    )  %>%
    mutate(
        date = str_replace_all(
            date, "(.+)/(.+)/(.+)", "20\\3-\\1-\\2"
        ) %>%
            ymd()
    ) %>%
    dplyr::rename(
        Country = `Country/Region`
    ) %>%
    dplyr::mutate(
        Country = case_when(
            grepl("China", Country) & `Province/State` == "Hubei" ~ "China (Hubei)",
            grepl("China", Country) & `Province/State` != "Hubei" ~ "China (Other)",
            !grepl("China", Country) ~ Country
        )
    ) %>%
    dplyr::filter(
        !is.na(confirmed)
    ) 

Population sizes for the most impacted countries were manually obtained from https://www.worldometers.info/ and should not be considered as authoritative. Given the disparity of infection within China, China was broken into Hubei Province and the rest of China. As an open acknowledgement of the crudeness of population values, population estimates for Hubei Province were taken from the 2018 estimates given by Statista.com. This particular population estimate is likely to be low, and as a result confirmed case rates for Hubei province may be an overestimate, and consequently, confirmed case rates for the rest of China may also be an underestimate.

Confirmed cases of COVID-19 as provided by the Chinese Government have been discussed elsewhere as unusual, and data appears potentially unreliable. In this analysis, discussions regarding accurate Chinese reporting are not considered further and data is simply presented as supplied by JHU.

However, all countries are likely to contain many unreported cases given the incomplete testing regimes in place for most countries. Similarly, reporting in many countries may have features that cause concerns regarding data integrity and this makes comparison across countries difficult.

hb <- 59170000
pops <- tribble(
    ~Country, ~Population,
    "China (Hubei)", hb,
    "China (Other)", 1408526449 - hb,
    "Italy", 60486925,
    "Iran", 83677594,
    "Spain", 46754778,
    "Germany", 83783942,
    "Korea, South", 51269185,
    "France", 65273511,
    "US", 331002651,
    "Switzerland", 8654622,
    "United Kingdom", 67886011,
    "Netherlands", 17134872,
    "Norway", 5408930,
    "Austria", 9006398,
    "Belgium", 11575214,
    "Sweden", 10081035,
    "Denmark", 5786274,
    "Japan", 126476461,
    "Malaysia", 32245488,
    "Canada", 37742154,
    "Australia", 25499884,
    "Portugal", 10196709,
    "Qatar", 2866531,
    "Czechia", 10703010,
    "Greece", 10423054,
    "Israel", 8615601,
    "Brazil", 212125368,
    "Finland", 5538181,
    "Singapore", 5836728,
    "Pakistan",  219629013,
    "Ireland", 4921810
)

Results & Discussion

Cumulative Incidence Rates

All information presented in the section below is effectively the cumulative, confirmed incidence rate. Recovered patients and those who have passed away are still included in these numbers. Testing rates and results are also not included, and these may be highly informative if able to be sourced accurately.

confirmed %>%
    group_by(Country, date) %>%
    summarise(confirmed = sum(confirmed)) %>%
    ungroup() %>%
    group_by(Country) %>%
    dplyr::filter(
        date == max(date)
    ) %>%
    ungroup() %>%
    right_join(pops) %>%
    mutate(
        rate = 1e6*confirmed / Population
    ) %>%
    arrange(desc(rate)) %>%
    dplyr::slice(1:25) %>%
    mutate(
        rate = sprintf("%.1f", rate),
        Population = comma(Population)
    ) %>%
    rename_at(vars(everything()), str_to_title) %>%
    rename(`Rate (Cases per Million)` = Rate) %>%
    pander(
        justify = "lrrrr",
        caption = paste(
            "The", nrow(.), "most impacted populations studied here and shown as a proportion of total population.", 
            "The final column provides the latest confirmed infection rate as cases per million people."
        )
    )
The 25 most impacted populations studied here and shown as a proportion of total population. The final column provides the latest confirmed infection rate as cases per million people.
Country Date Confirmed Population Rate (Cases per Million)
China (Hubei) 2020-03-18 67,800 59,170,000 1145.9
Italy 2020-03-18 35,713 60,486,925 590.4
Switzerland 2020-03-18 3,028 8,654,622 349.9
Spain 2020-03-18 13,910 46,754,778 297.5
Norway 2020-03-18 1,550 5,408,930 286.6
Iran 2020-03-18 17,361 83,677,594 207.5
Denmark 2020-03-18 1,115 5,786,274 192.7
Austria 2020-03-18 1,646 9,006,398 182.8
Korea, South 2020-03-18 8,413 51,269,185 164.1
Qatar 2020-03-18 452 2,866,531 157.7
Germany 2020-03-18 12,327 83,783,942 147.1
France 2020-03-18 9,105 65,273,511 139.5
Belgium 2020-03-18 1,486 11,575,214 128.4
Sweden 2020-03-18 1,279 10,081,035 126.9
Netherlands 2020-03-18 2,058 17,134,872 120.1
Finland 2020-03-18 336 5,538,181 60.7
Ireland 2020-03-18 292 4,921,810 59.3
Singapore 2020-03-18 313 5,836,728 53.6
Israel 2020-03-18 433 8,615,601 50.3
Portugal 2020-03-18 448 10,196,709 43.9
Czechia 2020-03-18 464 10,703,010 43.4
Greece 2020-03-18 418 10,423,054 40.1
United Kingdom 2020-03-18 2,642 67,886,011 38.9
Malaysia 2020-03-18 790 32,245,488 24.5
US 2020-03-18 7,783 331,002,651 23.5
minToInclude <- 5
startingPoint <- 1
nDays <- 50
p <- confirmed %>%
  group_by(Country, date) %>%
  summarise(confirmed = sum(confirmed)) %>%
  ungroup() %>%
  right_join(pops) %>%
  mutate(
    rate = 1e6*confirmed / Population
  ) %>%
  group_by(Country) %>%
  dplyr::filter(
    max(rate) > minToInclude
  ) %>%
  ungroup() %>%
  split(f = .$Country) %>%
  lapply(function(x, r = startingPoint){
    std <- min(dplyr::filter(x, rate > r)$date)
    x %>%
      dplyr::filter(date >= std) %>%
      mutate(days = date - std)
  }) %>%
  bind_rows() %>%
  mutate(
    days = as.integer(days),
    rate = round(rate, 2)
  ) %>%
  dplyr::filter(days <= nDays) %>%
  arrange(date) %>%
  mutate(Country = fct_inorder(Country)) %>%
  rename_all(str_to_title) %>%
  ggplot(
    aes(Days, Rate, colour = Country, Date = Date, Confirmed = Confirmed)
  ) +
  geom_point() +
  geom_line() +
  scale_y_log10() +
  xlab(
    paste(
      "Days since passing", 
      startingPoint, 
      "confirmed case/million"
    )
  ) +
  ylab("Confirmed Infection Rate (cases/million)")
ggplotly(p) 

Confirmed cases of COVID-19 for countries where the infection rate has exceeded 5 cases/million. Data is only shown for the first 50 calendar days since passing 1 confirmed case/million. Note that from the day records begin in this dataset, the confirmed infection rate in Hubei was 7.59 confirmed cases/million. To hide a country, click on the country in the plot legend. Clicking again on the country in the legend will restore the data within the plot. Countries are shown in order of the date at which they passed the 1 confirmed case/million mark. Regions of the plot are also able to be zoomed interactively. Please note the y-axis is shown on the logarithmic scale, so that a series of points which appear to be diagonal will indicate exponential growth. The flatter the line, the slower the growth and a perfectly horizontal line would indicate zero growth, or no new confirmed cases.

All figures and tables presented here simply aim to show an alternative viewpoint on the data. Every way to view COVID-19 data will mask important features, and the values shown here do not take into account vital factors such as population density, variability of infection across regions within countries, social culture and demographics. Many countries may not be directly comparable for a combination of the above factors. It is simply to view the data through the lens of a country’s population size using a value which should be easily interpretable.

In the above plot:

  • Only countries are shown where the infection rate has surpassed 5 cases/million. This was a completely arbitrary choice, but seemed likely to indicate a level of COVID-19 infection which posed noticeable community risk
  • The choice of “Days since passing 1 case/million” was also arbitrary, but seemed to be a useful way of enabling the comparison of COVID-19 progression across countries

Additionally, Australia’s spread of the virus appears slower than many other countries.

  • This may be a reflection of Australian geography or our tendency to have large personal space requirements
  • This may also be a reflection of the pre-emptive strategies already being taken by the population and government
  • Overall, the Australian infection rate brings mild relief that we are doing a little better than many other countries. However testing rates are far lower than in countries like South Korea and many genuine cases are likely to be undetected
  • Despite the above, Australia is still clearly in the exponential growth phase with no sign of the growth levelling out
  • The total percentage of the Australian population confirmed as infected currently sits at 0.0022%. The estimate of Australia’s population used for this calculation is 25,499,884, and whilst reasonably accurate, may not be authoritative.

Turning to other countries beyond Australia:

  • Only South Korea,and China appear to have brought the virus under some level of control. The infection rate currently sits around 164.1 per million for South Korea and 1145.9 per million for Hubei Province. This can also be thought of as 0.0164% and 0.1146% of the total populations respectively. Note that despite these apparently low percentages, the actual numbers of deaths and hospitalisations are extremely non-trivial and the real impacts of these numbers cannot be treated lightly.
  • Japan’s early measures also appear to be highly effective, however the recent growth in infections within Singapore should serve as a caution
  • Some of the above countries (e.g. Norway, Denmark) appear to be in the early stages of levelling out, indicating that the initial exponential growth phase may be ending in these countries.
  • Other countries such as Italy and the UK are clearly still in an exponential growth phase.
  • US numbers may also be unreliable given the low rates of testing and numerous anecdotal stories of suspected infections

Currently Active Infections

As an alternative viewpoint, the numbers of recovered and deceased patients have been removed from the above plot to provide an estimate of the currently active infections.

recovered <- url("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv") %>%
    read_csv() %>%
    pivot_longer(
        cols = ends_with("20"),
        names_to = "date",
        values_to = "recovered"
    )  %>%
    mutate(
        date = str_replace_all(
            date, "(.+)/(.+)/(.+)", "20\\3-\\1-\\2"
        ) %>%
            ymd()
    ) %>%
    dplyr::rename(
        Country = `Country/Region`
    ) %>%
    dplyr::mutate(
        Country = case_when(
            grepl("China", Country) & `Province/State` == "Hubei" ~ "China (Hubei)",
            grepl("China", Country) & `Province/State` != "Hubei" ~ "China (Other)",
            !grepl("China", Country) ~ Country
        )
    )
deaths <- url("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv") %>%
    read_csv() %>%
    pivot_longer(
        cols = ends_with("20"),
        names_to = "date",
        values_to = "deaths"
    )  %>%
    mutate(
        date = str_replace_all(
            date, "(.+)/(.+)/(.+)", "20\\3-\\1-\\2"
        ) %>%
            ymd()
    ) %>%
    dplyr::rename(
        Country = `Country/Region`
    ) %>%
    dplyr::mutate(
        Country = case_when(
            grepl("China", Country) & `Province/State` == "Hubei" ~ "China (Hubei)",
            grepl("China", Country) & `Province/State` != "Hubei" ~ "China (Other)",
            !grepl("China", Country) ~ Country
        )
    )
minToInclude <- 5
startingPoint <- 1
nDays <- 50
p2 <- confirmed %>%
  dplyr::filter(
    !str_detect(Country, "Other")
  ) %>%
  left_join(recovered) %>%
  left_join(deaths) %>%
  mutate(
    active = confirmed - recovered - deaths
  ) %>%
  group_by(
    Country, date
  ) %>%
  summarise(
    active = sum(active),
    confirmed = sum(confirmed),
    recovered = sum(recovered),
    deaths = sum(deaths)
  ) %>%
  right_join(pops) %>%
  mutate(rate = 1e6*active / Population)  %>%
  dplyr::filter(
    max(rate) > minToInclude
  ) %>%
  ungroup() %>%
  split(f = .$Country) %>%
  lapply(function(x, r = startingPoint){
    std <- min(dplyr::filter(x, rate > r)$date)
    x %>%
      dplyr::filter(date >= std) %>%
      mutate(days = date - std)
  }) %>%
  bind_rows() %>%
  mutate(
    days = as.integer(days),
    rate = round(rate, 2)
  ) %>%
  dplyr::filter(days <= nDays) %>%
  arrange(date) %>%
  mutate(Country = fct_inorder(Country)) %>%
  rename_all(str_to_title) %>%
  ggplot(
    aes(
      x = Days, y = Rate, colour = Country, 
      Date = Date, Active = Active,
      Confirmed = Confirmed, Recovered = Recovered,
      Deaths = Deaths
      )
  ) +
  geom_point() +
  geom_line() +
  scale_y_log10() +
  xlab(
    paste(
      "Days since passing", 
      startingPoint, 
      "confirmed active case/million"
    )
  ) +
  ylab("Confirmed Active Infections (cases/million)")
ggplotly(p2) 

Confirmed active cases of COVID-19 for countries where the confirmed infection rate has exceeded 5 cases/million. Data is only shown for the first 50 calendar days since passing 1 confirmed case/million. Due to difficulties introduced by the currently reported low active infection rate outside Hubei province, data from China has been excluded from this plot, with the exception of Hubei. To hide a country, click on the country in the plot legend. Clicking again on the country in the legend will restore the data within the plot. Countries are shown in order of the date at which they passed the 1 confirmed active case/million mark. Regions of the plot are also able to be zoomed interactively. Please note the y-axis is shown on the logarithmic scale, so that a series of points which appear to be diagonal will indicate exponential growth/decay.

Notable features of this perspective are:

  • No deaths from confirmed cases have been reported in Singapore. Despite recent increases in active infections, has “flattening the curve” been successful?
  • Active infection rates in South Korea are currently declining
  • According to these figures, active infections within Hubei province are also significantly declining

Comparison of Countries

In order to summarise which countries are the most similar to each other, a Principal Component Analysis was performed. This enables the multi-dimensional data of the above plots to summarised in two dimensions. As missing data cannot be included in this analysis, several countries which are at earlier comparative time-points than Australia were omitted from this analysis.

nDays <- 17
pca <- confirmed %>%
  group_by(Country, date) %>%
  summarise(confirmed = sum(confirmed)) %>%
  ungroup() %>%
  right_join(pops) %>%
  mutate(
    rate = 1e6*confirmed / Population
  ) %>%
  group_by(Country) %>%
  dplyr::filter(
    max(rate) > minToInclude
  ) %>%
  ungroup() %>%
  split(f = .$Country) %>%
  lapply(function(x, r = startingPoint){
    std <- min(dplyr::filter(x, rate > r)$date)
    x %>%
      dplyr::filter(date >= std) %>%
      mutate(days = date - std)
  }) %>%
  bind_rows() %>%
  mutate(
    days = as.integer(days),
    rate = round(rate, 2)
  ) %>%
  dplyr::filter(days <= nDays) %>%
  dplyr::select(-date, -confirmed, -Population) %>%
  pivot_wider(
    values_from = rate,
    names_from = days
  ) %>%
  as.data.frame() %>%
  column_to_rownames("Country") %>%
  as.matrix() %>%
  .[!rowAnyNAs(.),] %>%
  prcomp(scale = TRUE, center = TRUE) 
pca$x %>%
  as.data.frame() %>%
  rownames_to_column("Country") %>%
  ggplot(aes(PC1, PC2, label = Country)) +
  geom_point() +
  geom_text_repel() +
  xlab(
    paste0(
      "PC1 (", 
      percent(summary(pca)$importance["Proportion of Variance","PC1"], accuracy = 0.1), 
      ")"
      )
  ) +
    ylab(
    paste0(
      "PC2 (", 
      percent(summary(pca)$importance["Proportion of Variance","PC2"], accuracy = 0.1), 
      ")"
      )
  )
*Dimensional reduction showing which countries are most similar to each other at the 17 day mark. Countries which have not progressed beyond 1 case/million for 17 days are not shown. Principal Component 1, on the x-axis, corresponds to the greatest source of variability within the data (87.5%), and countries which appear near each other along this axis can be assumed to be showing similar growth in confirmed infection rates at this time point. Separation along the y-axis is less significant, but also worthy of note, as this represents 6.6% of variability within the data. At this early point, Australia's cumulative, confirmed infection rate is most similar to the countries which have responded well. However, viral control regimes introduced in Japan and Singapore may not be replicated in Australia, and as such, this may not be predictive of future outcomes beyond this specific time point.*

Dimensional reduction showing which countries are most similar to each other at the 17 day mark. Countries which have not progressed beyond 1 case/million for 17 days are not shown. Principal Component 1, on the x-axis, corresponds to the greatest source of variability within the data (87.5%), and countries which appear near each other along this axis can be assumed to be showing similar growth in confirmed infection rates at this time point. Separation along the y-axis is less significant, but also worthy of note, as this represents 6.6% of variability within the data. At this early point, Australia’s cumulative, confirmed infection rate is most similar to the countries which have responded well. However, viral control regimes introduced in Japan and Singapore may not be replicated in Australia, and as such, this may not be predictive of future outcomes beyond this specific time point.

R Session Information

R version 3.6.3 (2020-02-29)

Platform: x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_AU.UTF-8, LC_NUMERIC=C, LC_TIME=en_AU.UTF-8, LC_COLLATE=en_AU.UTF-8, LC_MONETARY=en_AU.UTF-8, LC_MESSAGES=en_AU.UTF-8, LC_PAPER=en_AU.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_AU.UTF-8 and LC_IDENTIFICATION=C

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: plotly(v.4.9.2), pander(v.0.6.3), ggrepel(v.0.8.1), matrixStats(v.0.55.0), scales(v.1.1.0), lubridate(v.1.7.4), forcats(v.0.4.0), stringr(v.1.4.0), dplyr(v.0.8.4), purrr(v.0.3.3), readr(v.1.3.1), tidyr(v.1.0.2), tibble(v.2.1.3), ggplot2(v.3.2.1) and tidyverse(v.1.3.0)

loaded via a namespace (and not attached): Rcpp(v.1.0.3), lattice(v.0.20-40), assertthat(v.0.2.1), digest(v.0.6.25), mime(v.0.9), R6(v.2.4.1), cellranger(v.1.1.0), backports(v.1.1.5), reprex(v.0.3.0), evaluate(v.0.14), highr(v.0.8), httr(v.1.4.1), pillar(v.1.4.3), rlang(v.0.4.4), lazyeval(v.0.2.2), readxl(v.1.3.1), rstudioapi(v.0.11), data.table(v.1.12.8), rmarkdown(v.2.1), labeling(v.0.3), htmlwidgets(v.1.5.1), munsell(v.0.5.0), shiny(v.1.4.0), broom(v.0.5.4), httpuv(v.1.5.2), compiler(v.3.6.3), modelr(v.0.1.6), xfun(v.0.12), pkgconfig(v.2.0.3), htmltools(v.0.4.0), tidyselect(v.1.0.0), fansi(v.0.4.1), viridisLite(v.0.3.0), later(v.1.0.0), crayon(v.1.3.4), dbplyr(v.1.4.2), withr(v.2.1.2), grid(v.3.6.3), xtable(v.1.8-4), nlme(v.3.1-144), jsonlite(v.1.6.1), gtable(v.0.3.0), lifecycle(v.0.1.0), DBI(v.1.1.0), magrittr(v.1.5), cli(v.2.0.1), stringi(v.1.4.6), farver(v.2.0.3), promises(v.1.1.0), fs(v.1.3.1), xml2(v.1.2.2), generics(v.0.0.2), vctrs(v.0.2.3), tools(v.3.6.3), Cairo(v.1.5-10), glue(v.1.3.1), hms(v.0.5.3), crosstalk(v.1.0.0), fastmap(v.1.0.1), yaml(v.2.2.1), colorspace(v.1.4-1), rvest(v.0.3.5), knitr(v.1.28) and haven(v.2.2.0)